Exploring Complex Networks through Random Walks 
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Most real complex networks - such as protein interactions, social contacts, the internet - are only 
partially known and available to us. While the process of exploring such networks in many cases 
resembles a random walk, it becomes a key issue to investigate and characterize how effectively the 
nodes and edges of such networks can be covered by different strategies. At the same time, it is 
critically important to infer how well can topological measurements such as the average node degree 
and average clustering coefficient be estimated during such network explorations. The present article 
addresses these problems by considering random and Barabasi-Albert (BA) network models with 
varying connectivity explored by three types of random walks: traditional, preferential to untracked 
edges, and preferential to unvisited nodes. A series of relevant results are obtained, including the 
fact that random and BA models with the same size and average node degree allow similar node 
and edge coverage efficiency, the identification of linear scaling with the size of the network of the 
random walk step at which a given percentage of the nodes/edges is covered, and the critical result 
that the estimation of the averaged node degree and clustering coefficient by random walks on BA 
networks often leads to heavily biased results. Many are the theoretical and practical implications 
of such results. 

PACS numbers: 05.40.Fb,89.75.Hc,07.05.Mh 



The crew of the caravel 'Nina ' also saw signs of land, 
and a small branch covered with berries. Everyone 
breathed afresh and rejoiced at these signs. 

(Christopher Columbus) 



I. INTRODUCTION 

Despite its relatively young age, the area of investiga- 
tion going by the name of complex networks |j]. U. y. U. l| 

has established itself as a worthy relative - or perhaps 
inheritor - of graph theory and statistical physics. Such 
a success has been a direct consequence of the empha- 
sis which has been given to structured interconnectivity, 
statistical formulations, interest in applications and, as in 
more recent developments (e.g. fcJIH), the all-important 
paradigm relating structure and dynamics. Yet, almost 
invariably, the analyzed networks are assumed to be com- 
pletely known and accessible to us. Indeed, while so 
many important problems involving completely described 
networks - such as community finding (e.g. |fj) - remain 
as challenges in this area, why should one bother to con- 
sider incompletely specified networks? 

Perhaps a good way to start making sense of this ques- 
tion is by considering our future. To what restaurant are 
we going tomorrow? What article will we read next? 
Which mirrors will ever see our faces again? Would not 
each such situation be describable as a node, while the 
flow of decisions among the possibilities would define 
a most extraordinary personal random walking a most 
complex network? Although such a dynamic network is 
undoubtedly out there (or in here), we are allowed to 
explore just a small portion of it at a time. And, with 



basis on whatever knowledge we can draw from such a 
small sample, we have to decide about critical next steps. 
However, the situations involving incomplete or sampled 
networks extend much further than this extreme exam- 
ple. For instance, the steps along any game or maze is but 
a sample of a much larger network of possibilies. Explo- 
rations of land, sea and space also correspond to small 
samplings of a universe of possibilities, not to mention 
more 'classical' large networks such as those obtained for 
protein interaction, social contacts and the Internet. Last 
but not least, the own exploratory activities of science 
are but a most complex random walk on the intricate 
and infinite web of knowledge Q • In all such cases, the 
success of the whole enterprise is critically connected to 
the quality and accuracy of the information we can infer 
about the properties of the whole network while judg- 
ing from just a small sample of it. Little doubt can be 
raised about the importance of such a problem, which 
has received little attention, except for the investigations 
by Serrano et al. on the effects of sampling the WWW 
through crawlers [8| . The literature about random walks 
in complex networks include 0, UlA UJ1> LL2| ■ 



The current article is about incomplete and sampled 
networks and some related fundamental questions. We 
start with the basic mathematical concepts, identifying 
some of the most interesting related questions and per- 
spectives, and proceed by illustrating what can be learnt 
about random and Barabasi-Albert networks while sam- 
pling them locally in random fashion or through three 
types of random walks — traditional (uniform decision 
probability) and preferential to untracked edges and pref- 
erential to untracked nodes). 



II. BASIC CONCEPTS AND SOME 
FUNDAMENTAL ISSUES 

An undirected [l]| complex network T = (Cl,E), in- 
volving a set of N nodes and a set V of E connec- 
tions between such nodes, can be completely specified 
in terms of its N x N adjacency matrix K, such that 
the existence of a connection between nodes i and j im- 
plies K(i,j) — K(j,i) — 1 (zero is assigned otherwise). 
The degree ki of node i is defined as corresponding to 
the number of edges connected to that node, and the 
clustering coefficient cci of that node is calculated as the 
number of edges between the neighbors of i and the max- 
imum possible number of interconnections between these 
neighbors (e.g. 0|). 

An incompletely specified complex network is hence- 
forth understood as any subnetwork G of T such that 
G / T. In this work we will restrict our atten- 
tion to incomplete complex networks defined by sets 
of nodes and adjacent edges identified during random 
walks. Such networks can be represented as G = 
((h, Vi); (t'2, V 2 ), . . . , (i&f) Vkf))) where i p are nodes sam- 
pled during the random walk through the respective list 
of adjacent nodes. Note that necessarily i p+ i £ V p and 
that (ii, «2, ■ ■ ■ , *m) corresponds to a path along I\ It is 
also interesting to consider more substantial samples of T, 
for instance by considering not only the adjacent edges, 
but also the interconnections between the neighboring 
nodes of each node. Therefore, the case above becomes 
G = ({ii,Vi,Ei);(i2,V2,E2),...,(iM,VM,E M )), where 
E p is the set containing the edges between the nodes in 
V p . Figure Q] illustrates a complex network (a) and re- 
spective examples of incompletely specified networks ob- 
tained by random walks considering neighboring nodes 
(b) and the latter plus the edges between neighboring 
nodes (c). 

Given an incomplete specified complex network G, a 
natural question which arises is: to what accuracy the 
properties of the whole network T can be inferred from 
the available sampled information? Because the estima- 
tion of global properties of T such as shortest paths and 
diameter constitutes a bigger challenge to the moving 
agent, we concentrate our attention on local topological 
properties, more specifically the node degree and cluster- 
ing coefficient averaged over the network. 

Three types of random walks are considered in the 
present work: (i) 'traditional': the next edge is chosen 
with uniform probability among the adjacent edges; (ii) 
preferential to untracked edges: the next edge is cho- 
sen among the untracked adjacent edges and, in case no 
such edges exist, uniformly among all the adjacent edges; 
and (iii) preferential to unvisited nodes: the next edge is 
chosen among those adjacent edges leading to unvisited 
nodes and, in case no such edges exist, uniformly among 
all the adjacent edges. Note that the plausibility of the 
preferential schemes depends on each modeled system. 
For instance, the preference to untracked nodes implies 
that the moving agent knows whether each edge leads to 




(a) 



(b) 



(c) 



FIG. 1: A simple network (a) and two incompletely specified 
networks obtained by a random walk considering neighboring 
nodes and (b) and the latter plus the edges between adjacent 
nodes (c). The gray nodes correspond to those sampled during 
the random walk 



a still unvisited node, though it may not know exactly 
which one. It is interesting to note that the process of 
sampling an existing network through a random walk can 
be interpreted as a mechanism for 'growing' a network. 



III. NODE AND EDGE COVERAGE 

First we consider networks growth according to one 
of the following two complex network models: (a) ran- 
dom networks, where each pair of nodes has a probability 
A of being connected; and (b) Barabdsi- Albert networks 
(BA), built by using the preferential attachment scheme 
described in [l|. More specifically, new nodes with m 
edges each are progressively incorporated into the grow- 
ing network, with each of the m edges being attached to 
previous nodes with probability proportional to their re- 
spective node degrees. The network starts with mO = m 
nodes. Complex networks with number of nodes N equal 
to 100, 200, . . . , 900 and m = 3, 4, . . . , 8 have been consid- 
ered. An equivalent average degree and number of edges 
was imposed onto the random networks as in pj. A to- 
tal of 200 realizations of each configuration, for the three 
types of random walks, were performed experimentally. 

Figure |21 illustrates the ratio of tracked edges in terms 



of the steps t for N — 100 and N — 300 considering 
to = 3, 4, . . . , 8 and the BA network model. It is clear 
from the obtained results that, as expected, the higher 
the value of m, the smaller the ratio of visited edges. 
Note that the increase of N also contributes to less effi- 
cient coverage of the edges, as expressed by the respective 
smaller ratios of visited edges obtained for N = 300. For 
large enough total number of steps, all curves exhibited 
an almost linear initial region followed by saturation near 
the full ratio of visited edges (i.e. 1). 




FIG. 2: The ratio of tracked edges in terms of the steps t 
for TV = 100 and N = 300 considering the values of m as 
presented in the legend. 



Figure |3{a) shows the 'quarter-lives' h of the percent- 
age of visited nodes in terms of the network size N with 
respect to the BA network model with to = 5. This mea- 
surement corresponds to the average number of steps at 
which the random walk has covered a quarter of the to- 
tal number of network nodes. Similar results have been 
obtained for other critical fractions (e.g. half-life). Note 
that, as to is fixed at 5, the average degree (k) of all 
networks in this figure remains equal to 10 |l4|. being 
therefore constant with TV, while the average number of 
edges grows as (E) = N (k) /2 = 5N. Interestingly, lin- 
ear dependence between the quarter-lives and N are ob- 
tained in all cases. It is also clear from these results 
that the most effective coverage of the nodes is obtained 
by the random walk preferential to unvisited nodes, with 
the random walk preferential to untracked edges present- 
ing the next best performance. Figure |3Jb) shows the 
quarter-lives of the percentage of visited nodes for ran- 
dom networks. The random walks with preference to 
unvisited nodes again resulted in the most effective cov- 
ering strategy. The quarter-lives for the percentage of 
tracked edges are shown in Figures[3{c,d) respectively to 
BA (c) and random (d) network models. The best ra- 
tios of covered edges were obtained for the random walk 
preferential to untracked edges, with the random walk 
preferential to unvisited edges presented the next best 
performance. The traditional random walk resulted the 
least efficient strategy in all situations considered in this 



work. Note that the three types of random walks resulted 
slightly more effective for node coverage in the case of the 
random model than in the BA counterparts. 

Further characterization of the dynamics of node cov- 
erage can be obtained by considering the scaling of the 
slopes of the curves of ratios of visited nodes in terms of 
several values of m. Remarkably, the slopes obtained by 
least mean square fitting for the two types of preferential 
random walks were verified not to vary with m, being 
fixed at 0.31 and 0.25, respectively, for the BA model 
and 0.285 and 0.25 for the random networks. Figure 01 
shows the log-log representation of the slopes in terms of 
to obtained for the traditional random walk on BA and 
random networks for m=3,4,...,8. It is clear from 
this figure that, though the slopes tend to scale in simi- 
lar (almost linear) fashion for the two types of considered 
networks, the ratios of node coverage increase substan- 
tially faster for the random networks. 



IV. ESTIMATION OF AVERAGE NODE 
DEGREE AND CLUSTERING COEFFICIENT 

So far we have investigated the dynamics of node and 
edge coverage in random and BA models while consid- 
ering the three types of random walks. In practice, as 
the size of the network being explored through the ran- 
dom walks is typically unknown, the number of visited 
nodes or tracked edges by themselves provide little infor- 
mation about the topological properties or nature of the 
networks. The remainder of the present work addresses 
the estimation of measurements of the local connectivity 
of networks, namely the average node degree and average 
clustering coefficient, obtalined along the random walks. 

For generality's sake, the estimations are henceforth 
presented in relative terms, i.e. as the ratio between the 
current estimation (e.g. (k(t)) and the real value (e.g. 
(k)). Figure|S|illustrates the curve defined by the estima- 
tions (k(t),cc(t)) obtained by traditional random walks 
along a BA network with A?" = 800 and to = 5. Interest- 
ingly, this curve is indeed a kind of random walk with con- 
vergent limit. Such curves have been found to converge to 
limiting ratios (&l,ccl) which can or not correspond to 
the ideal ratios (1, 1). In the case of the curve in Figure|31 
we have (fc^ = 1.62, ccl — 0.88), i.e. the average node de- 
gree has been over-estimated while the average clustering 
coefficient has been under-estimated. Through extensive 
simulations, we have observed that the estimations of av- 
erage node degree tended to be about twice as much as 
the real value while the average clustering coefficient re- 
sulted about 0.9 of the real value, irrespectively of N, 
to or random walk type. Contrariwise, the estimation 
of these two topological features for random networks 
tended to produce stable and accurate estimation of the 
average clustering coefficient, while the obtained aver- 
age node degree presented very small variation around 
the optimal value. The substantial biases implied by the 
random walk over BA networks is a direct consequence 
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FIG. 3: The quarter-life of the percentage of visited nodes 
for BA (a) and random (b), and the quarter- life of the per- 
centage of visited edges for BA (c) and random (d) models, 
for traditional ('+')> preferential to untracked edges ('x') and 
preferential to unvisited nodes ('©') random walk strategies. 
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FIG. 4: Loglog (Neper) representation of the slopes of the 
ratios of visited nodes obtained for traditional random walks 
for m = 3, 4, . . . , 8 considering BA ('x') and random ('©') 
network models. 



of the larger variability of node degree exhibited by this 
model. Therefore, nodes with higher degree will tend to 
be visited more frequently |15J , implying over-estimation 
of the average node degree and a slighted bias on the 
clustering coefficient. 

Provided the moving agent can store all the informa- 
tion obtained from the network as it is being explored, 
yielding a partial map of the so far sampled structure, it 
is possible to obtain more accurate (i.e. unbiased) esti- 
mates of the average node degree and clustering coeffi- 
cient during any of the considered random walks in any 
type of networks by performing the measurements with- 
out node repetition. However, an agent moving along 
a BA network without resource to such an up to dated 
partial map will have to rely on averages of the mea- 
surements calculated at each step. This will cause the 
impression of inhabiting a network much more complex 
(in the sense of having higher average node degree) than 
it is indeed the case. Going back to the motivation at the 
beginning of this article, it is difficult to avoid speculat- 
ing whether our impression of living in a world with so 
many possibilities and complexities could not be in some 
way related to the above characterized effects. 



V. CONCLUDING REMARKS 

The fact that most real complex networks are only par- 
tially available to us as a consequence of their sheer size 
and complexity, it becomes of critical importance to un- 
derstand how well these structures can be investigated by 
using sampling strategies such as different types of ran- 
dom walks. The present work has addressed this issue 
considering random and BA network models with vary- 
ing connectivity and sizes being sampled by three types 
of random walks. A series of important results have been 
obtained which bear several theoretical and practical im- 




FIG. 5: Curve (actually a kind of random walk) defined by the 
estimations (k(t) , cc(t)) , through traditional random walk, of 
the average node degree k(t) and average clustering coeffi- 
cient cc(t) in a BA network with N = 800 and m = 5. The 
curve converges to the incorrect estimations ratios (f .62, 0.88) 
instead of (I, I). 



plications. Particularly surprising is the fact that random 
and BA networks are similarly accessible as far as node 
and edge exploration is concerned. Actually, random net- 
works tend to have their nodes and edges explored in a 
slightly more effective way. Also important is the char- 
acterization of linear scaling with the network size of the 
quarter-life of the ratio of covered nodes and edges, and 
the identification substantial biases in estimations of the 
average node degree and clustering coefficient in several 
situations. In particular, the average node degree tend to 
be estimated as being approximately twice as large as the 
real value. Additional insights about the non-trivial dy- 
namics of complex network exploration through random 
walks can be achieved by considering other network mod- 
els as well as more global topological measurements such 
as shortest paths, diameters, hierarchical measurements 
and betweeness centrality. 
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The results in this article are immediately extended to 

more general networks, including directed and weighted 

models. 

In the BA model, the average degree is equal to 1m. 

Actually the rate of visits to the nodes of an undirected 

complex network, at thermodynamical equilibrium, can 

be verified to be proportional to the node degree. 



